Characterization of Blood Group Variants in an Omani Population by Comparison of Whole Genome Sequencing and Serology

Although blood group variation was first described over a century ago, our understanding of the genetic variation affecting antigenic expression on the red blood cell surface in many populations is lacking. This deficit limits the ability to accurately type patients, especially as serological testing is not available for all described blood groups, and targeted genotyping panels may lack rare or population-specific variants. Here, we perform serological assays across 24 antigens and whole genome sequencing on 100 Omanis, a population underrepresented in genomic databases. We inferred blood group phenotypes using the most commonly typed genetic variants. The comparison of serological to inferred phenotypes resulted in an average concordance of 96.9%. Among the 22 discordances, we identify seven known variants in four blood groups that, to our knowledge, have not been previously reported in Omanis. Incorporating these variants for phenotype inference, concordance increases to 98.8%. Additionally, we describe five candidate variants in the Lewis, Lutheran, MNS, and P1 blood groups that may affect antigenic expression, although further functional confirmation is required. Notably, we identify several blood group alleles most common in African populations, likely introduced to Oman by gene flow over the last thousand years. These findings highlight the need to evaluate individual populations and their population history when considering variants to include in genotype panels for blood group typing. This research will inform future work in blood banks and transfusion services.


Introduction
In the 124 years since the discovery of the ABO blood group system, 45 different blood groups have been described in humans along with 50 associated blood group genes 1 .Despite extensive knowledge of these various blood group systems, most have been described via case studies, and recent population genomic analyses suggest there is much left to be discovered.For instance, a study of genomic variation in African, European, South Asian, East Asian, and American populations from the 1000 Genomes project 2 identified 1,241 nonsynonymous (NS) variants within 43 blood groups genes, and reported that 1,000 of the NS variants (81%) were not known blood group polymorphisms, yet 357 were extracellular and thus potentially antigenic 3 .
Another study of the same dataset identified only 120 of 604 known blood group variants 4 , with 36 of these found in at least one continental region where they had not previously been described 4 .This suggests that many known variants affecting blood group variation are rare, and that we still lack a complete understanding of the distribution of blood group alleles in many populations 4 .
For example, a recent study from Oman demonstrated that rare, undescribed variants likely affect antigenic expression in multiple blood groups 5 .In this study, targeted genotyping and serological assays were compared for 19 different antigens belonging to six blood group systems.
Although overall concordance was high (>95%), Fy b was an exception (concordance 87%), and only 3 antigens had 100% concordance 5 .Discordances were likely due to the effect of genetic variants that were absent from the genotyping assay.While the prevalence of common blood group antigens or blood group alleles have been documented across much of the Arabian Peninsula 6,7,8- 11,12 , no comparison of sequencing data and serology has been conducted to identify additional variants affecting antigenic expression in this region.
Targeted genotyping is increasingly being investigated as an alternative or complement to serology for blood group phenotyping 13 .Benefits of genotyping include improving red cell matching for multi-transfused patients, those at an increased risk of alloimmunization, such as patients with sickle cell disease, and those with autoantibodies.The addition of a genotyping strategy is of interest in Oman where hemoglobinopathies and risk of alloimmunization are common in the population [14][15][16] .However, a more complete picture of rare and population-specific variants affecting antigenic expression is necessary.Whole genome sequencing provides a comprehensive view of blood group loci, including indels and copy number variation, particularly at more complex loci such as those that determine the RH and MNS blood groups 17,18 .
Here, we compare antigen typing inferred by whole genome sequencing to antigen expression determined by serology to identify variants contributing to blood group variation in the Omani population.We identified seven variants that have previously not been described in Omanis.Additionally, we identified five candidate variants that may be affecting antigen expression in the Lewis, Lutheran, MNS and P1 blood groups by altering erythrocyte-specific transcription factor binding sites, or by altering the coding regions near alleles encoding the blood group antigens.These findings should be considered when selecting red cell genotyping platforms for blood banks and transfusion services in Oman and nearby regions.

Sample Collection
A description of the samples used in this analysis, including DNA extraction and shipping conditions, has been previously published 19 .Briefly, 100 healthy male and female Omani blood donors between the ages of 18 and 60 years attending the Sultan Qaboos University Hospital (SQUH) blood bank were randomly selected and consented for enrollment in the study.The Medical Research Ethics committee at the College of Medicine and Health Sciences, the Sultan Qaboos University approved this study (MREC #2034, 2019).

Blood Bank Methods
Red blood cell phenotyping was performed within 24 hours of collection at SQUH Blood bank using BioRad© antisera on freshly drawn samples per the manufacturer instructions (BioRad©, Cressier Switzerland) and as previously published 12 .The following blood systems and antigens were tested: ABO (A,B antigens), Rh (C,c,E,e antigens), Kell (K, k, Kp a , Kp b antigens), Kidd (Jk a , Jk b antigens), Duffy (Fy a ,Fy b antigens), Lewis (Le a ,Le b antigens), Lutheran (Lu a ,Lu b antigens), and MNS (M,N,S,s antigens).A clear red cell button at the bottom of the phenotyping well was defined as a negative reaction for all antigens (grade 0).Rh D reactions of 0 or 1 are further tested for weak D. Reactions positive for weak D are reported as Rh D positive and reactions negative for weak D testing are reported as Rh D negative as per manufacturer instructions.Other reaction patterns were defined as positive and were graded (1-4) for each antigen phenotyped.We included known positive and negative samples as internal controls for each antigen.

Genome Sequencing, Alignment and Variant Calling
As previously described 19 , short read (150bp paired-end) whole genome sequencing was performed to an average coverage of 16X at the Huntsman Cancer Institute High-Throughput Genomics Shared Resource at the University of Utah.The sequence reads were aligned to GRCh38 with BWA-MEM 20 and variants were called following the GATK best practices protocol 21,22 .
Haplotype phasing was done using Eagle v2 23 to produce a haplotype variant call file.
Inferring Blood Group Phenotypes ABO, RHCE, Kell, Kidd, Duffy, Lewis, Lutheran, MNS, and P1 Inference with SNVs Using the databases available from ISBT 1 and BloodAntigens.com 13 , we curated a list of variants for inferring blood group phenotypes from SNV genotypes (Table 1).

RHD and RHCE Copy Number Inference
RHD phenotypes were inferred using a copy number analysis 13 .Using aligned reads filtered for a MAPQ > 20, coverage across the RHD locus (1:25272393-25330445) and RHCE locus (1:25360659-25430193) was calculated using SAMtools 24 .Using the equation described by Lane et al. 2018 13 , a ratio of RHD to RHCE coverage between 0-0.5 was classified as null, 0.6-1.5 as hemizygous, and 1.6-2.5 as homozygous.RHCE C/c antigen phenotypes were also inferred using a copy number analysis suggested by Lane et al. 13 , comparing coverage of RHCE exon 2 to the entire RHCE locus.A ratio of coverage greater than or equal to 1.5 was inferred as C-c+, 0.5-1.4 as C+c+, and less than 0.5 as C+c-.

MNS Copy Number Inference
To identify copy number variation, we inferred the underlying copy number state from observed coverage at sites with high mappability in 1600bp windows across the GYPA, GYPB and GYPE loci using a Hidden Markov Model as previously described 19,25 .

Concordance calculations
We calculated two concordances in this analysis, considering the phenotype determined by serology as truth.The average concordance per blood group is defined as the overall percent of all correctly inferred phenotypes from genotype data for that blood group.For instance, for the Kidd blood group, this would be calculated as follows: The antigen concordance is calculated per antigen and is defined as the percent of individuals with individual antigen expression correctly predicted by genotype inferences over the total number of individuals.Using the Kidd blood group as an example, antigen concordance for the Jk(a) antigen would be calculated as follows: (#   ( +)  (−))  #   ( = 100) × 100 =    ()

Results
In 100 Omani blood donors, we compared blood group phenotypes determined by serology (Table S1) and by inference from genetic variants called from whole genome sequencing data for a commonly used serology panel including ABO, Rh, Kell, Kidd, Duffy, Lewis, Lutheran, P1, and MNS (Table 1).Using the most common variants underlying the 24 antigens tested, we evaluated the concordance for each blood group system as well as for each antigen (Table 1).Across all blood group systems, the average concordance was 96.9% (Table 1).Two blood group systems had a phenotype-genotype concordance of 100% (Kell and Kidd).The remaining seven blood groups had a concordance greater than 95% with the exception of the MNS and RHCE C/c blood group systems which had concordances of 91%.
We identified a total of 22 discordant samples (Table S2).Among these, we identified seven known variants in eleven samples affecting antigen expression in the Rh, Duffy, Lewis, and MNS blood groups that were previously undescribed in the Omani population (Table 2).We also identified a putatively novel variant in the Lewis blood group as well as three variants in transcription factor binding sites specific to erythrocyte expression or erythropoiesis that could be altering antigen expression in the Lutheran and P1 blood groups (Table 3).Additionally, we identified a structural variant in the glycophorin gene region that may alter S antigen expression of the MNS blood group (Table 3).There are eight discordances without a candidate novel variant that remain unresolved, all of which are in the Rh and MNS blood groups.We discuss the discordances and the identified variants for each blood group in detail below.

ABO Blood Group
The AB and O antigens are encoded by the ABO gene on chromosome 9.Using rs8176747 (Gly267Ala) and rs8176746 (Leu265Met) to infer the A and B antigens and rs8176719 (Thr87AspfsTer107) to infer the O antigen resulted in a concordance of 98%.Because Thr87AspfsTer107 is most commonly found on a haplotype that expresses the A antigen 26 , 13 individuals heterozygous for all three SNVs were inferred as blood type B. However, the variants are not in complete linkage disequilibrium (D'=0.8)and one was found to express A by serology.
The variants are too far apart for physical phasing, but this sample was imputed as carrying 87AspfsTer107 on the same haplotype as the alleles encoding Ala and Met, consistent with the A blood type.The other discordant sample was called as homozygous for Thr87AspfsTer107 and thereby inferred as O blood type but expressed the B antigen via serology.Further investigation revealed that this individual had one read with the insertion.Sanger sequencing confirmed that this individual was in fact heterozygous for Thr87AspfsTer107, resolving this discordance.

Rh Blood Group
For the Rh blood group, we typed the D, C, c, E, and e antigens encoded by the adjacent RHD and RHCE genes.The presence of the RHD gene on chromosome 1 results in the expression of D antigen whereas homozygosity for a complete RHD gene deletion is the most common cause of the Rh D negative phenotype 27 .Serology and genotype inference were discordant for the D antigen in one individual.This sample was inferred as D+ by sequence data, supported by numerous reads mapping to the RHD gene but serologically, the D antigen was not detected.This individual was found to carry the RHD pseudogene allele (RHD*) that consists of a 37bp duplication in exon 4, which introduces a premature stop codon resulting in early truncation of the RHD gene 28 .The allele frequency (AF) of RHD* in the Omanis is similar to the AF in African/African American population in gnomADv4.0 29(Figure 1), with AFs of 0.0376 and 0.0389, respectively.This is higher than the gnomADv4.0Middle Eastern population (AF=0.0022),indicating heterogeneity across the Middle East, likely due to variation in African ancestry 19 .
The E and e antigens, determined by alternative alleles at rs609320 (Ala226Pro), had 99% concordance.The discordant sample, inferred as E-e+ but serologically reported as E+e+, was found to carry an allele known to cause weak E expression, rs141398055 (Arg201Thr) 30 .
Arg201Thr has the highest allele frequency in Middle Eastern populations (AF = 0.0036) in gnomADv4.0and a frequency of 0.015 in the Omanis (Figure 1, Table 2).
The C antigen results from the presence of RHD exon 2 sequence in the paralogous location in RHCE (likely due to gene conversion), which can be detected as misalignment of reads to RHD exon 2. We initially used the approach implemented by Lane et al. 13 comparing RHCE exon 2 coverage to the coverage of the RHCE locus.This resulted in a concordance of 91%.However, since the reads should be misaligning to RHD exon 2, we then compared the coverage across exon 2 of both genes in individuals that were hemizygous or homozygous for RHD.To account for hemizygosity, we adjusted the coverage range to > 0.67 for C-c+, 0.1-0.66 for C+c+, and < 0.1 for C+c-.Using this approach, the C and c antigens had 99% concordance.The discordant sample was inferred as C-c+ but serologically reported as C+c+ (Table S2) and remains unresolved as they are D negative so we could not apply this second approach.

Duffy Blood Group
The genotype-phenotype concordance and resolving variants for the Duffy blood group in this dataset have been previously published in an analysis of genetic ancestry and positive selection at the Duffy blood group locus, ACKR1 19 .Briefly, inference of Fy a and Fy b antigen expression using rs12075 (Gly42Asp) and rs2814778 for the erythrocyte silent (ES) allele, Fy ES , resulted in a concordance of 96%.We found three discordant individuals genetically inferred as Fy(a-b+) but serologically reported as Fy(a-b-) to carry the Duffy X allele, Fy X (rs34599082 Arg89Cys), that results in weak Fy b expression and had previously not been described in Oman 31 .The fourth discordant individual was also genetically inferred as Fy(a-b+) and serologically reported as Fy(ab-), but they did not carry the Fy X allele.Instead, we identified a two base pair frameshift resulting in early protein termination (rs773692057 Ser62fs) carried by this individual, together with the Fy ES allele causing the Duffy negative phenotype.The frameshift allele is rare but present in additional individuals from Oman and other populations in the Arabian Peninsula 19 .

Lewis Blood Group
Two loci must be considered when inferring phenotypes of the Lewis blood group.The secretor, Le(a-b+), and non-secretor, Le(a+b-), phenotypes are most commonly determined by a nonsense mutation (rs601338 Trp154Ter) in FUT2 on chromosome 19 32 whereas the null phenotype, Le(a-b-), is caused by a variety of different alleles in FUT3 that encode nonfunctional transferases, regardless of FUT2 genotype 33 .We inferred the secretor phenotype using Trp154Ter in FUT2 and the null phenotype using three different SNVs in FUT3 that have been associated with Le(a-b-) in an Iranian population (rs28362459 Leu20Arg, rs812936 Arg68Trp, and rs778986 Met105Thr) 33 .This resulted in a concordance of 96%.When expanding to consider additional SNVs known to cause the null phenotype, we found that thirteen Omani individuals carried rs3894326 (Ile356Lys), a SNV commonly used for genotyping FUT3 in European, South Asian and East Asian populations 34,35 (Figure 1) leading to a revised concordance of 99%.The remaining discordant sample was inferred as Le(a-b+), but the serology reported them as Le(a+b+), a rare phenotype indicating a functional FUT3 allele but a weak secretor allele at FUT2.This individual carried a unique missense variant, rs373779096 (Ala335Thr), that is very rare in gnomadv4.0but primarily found in individuals of African/African American or admixed American ancestry (AF = 0.00037 and 0.00025 respectively) and in a single Middle Eastern individual (AF = 0.00017).
Ala335Thr is located in the same exon as two other known weak secretor alleles (rs1047781 Ile140Phe and rs532253708 Met99Leu) that reduce enzymatic activity of alpha(1,2)fucosyltransferase [36][37][38] , and therefore may represent a new weak secretor allele, though further confirmation is needed.

Lutheran Blood Group
The Lutheran blood group is encoded by the BCAM locus on chromosome 19.Using rs28399653 (Arg77His) to infer expression of Lu a and Lu b antigens, there was 99% concordance.
The discordant sample was inferred as Lu(a-b+) by genotype but serologically reported as Lu(ab-).The presence of Lu(a-b-) is consistent with the frequency observed in the previous study in Oman 12 , and suggests a higher frequency than elsewhere [39][40][41][42][43][44][45] .The Lu(a-b-) phenotype can either be due to homozygosity for loss of function alleles or expression of the BCAM gene below the level of detection by serology.The latter, referred to as In(Lu), is more common and has been attributed to heterozygous mutations affecting the transcription factor EKLF 44 .We looked for additional variants within the BCAM locus, in EKLF, and seven other erythroid transcription factors using the UCSC Genome Browser and JASPAR transcription factors tract 44,[46][47][48] .We did not identify any loss-of-function alleles carried by this individual in BCAM or in EKLF.However, we did identify two adjacent SNVs falling in a GATA1 binding site for SPI1 that are unique to this individual and thus a candidate for a new allele encoding a In(Lu) phenotype (rs533045163 and rs184739796).Neither SNV are reported in Middle Eastern individuals from the gnomADv4.0database.Although rare, they are most common to African/African American individuals (AF = 0.006 for both SNVs).

P1 Blood Group
We inferred the P1 and P2 phenotypes of the P1 blood group using four intronic variants in A4GALT on chromosome 22 reported to be associated with P1 antigen expression: rs66781836, rs5751348, rs8138197, and rs2143918 52 , although the causal variant remains unknown 53 .
Phenotypes inferred with rs66781836 had the lowest accuracy with 97% concordance.The other three almost always occurred together and had a concordance of 99% suggesting rs66781836 is less likely to be the causal variant affecting P1 antigen expression, consistent with previous results 54,55 .The discordant sample, inferred as P1, was heterozygous for all three SNVs despite the serology reporting them as P2.Given that one of the three most likely causal variants, rs5751348, falls within an intronic transcription factor binding site 55 , we investigated other transcription factor binding sites within the A4GALT locus.We identified one potentially causal variant that falls within a STAT1 transcription factor binding site.This variant at position 22:42,721,266 is absent from gnomADv4.0 and dbSNP, suggesting it is extremely rare.However, two Omanis were found to be heterozygous: the discordant sample and a concordant sample homozygous for all three reference alleles serologically reported as P1.

Discussion
Here, we comprehensively document the alleles underlying common blood group antigens in an Omani population by comparison of whole genome sequencing to serology.We demonstrate high concordance for all commonly tested blood group antigens in routine transfusion practice and report several recognized alleles altering blood group antigen expression that have previously not been described in Omanis.Notably, we describe alleles resolving discordances in the Rh, Duffy, Lewis, and MNS blood group systems (Table 2) in three or more unrelated individuals, suggesting these alleles are relatively common in the Omani population.Thus, these alleles should be considered for inclusion in red cell genotyping methods used in blood banks in Oman.
Additionally, although singletons in this dataset, GYP.He(P2) and the 2 base pair ACKR1 frameshift (Ser62fs), identified in a previous study with these samples 19 , should also be included given their resulting null phenotypes.These two alleles were also observed in other Arabian Peninsula and greater Middle Eastern populations indicating they are likely present throughout the region, although singletons in this dataset (Figure 1) 19 .Inclusion of additional populations from this region would provide a better understanding of how prevalent these alleles are and their relevance to red cell genotyping methods in other Arabian Peninsula populations.
While the overall blood group concordance was high and we were able to resolve discrepancies with this approach (revised overall concordance = 98.8%,Table 4), this analysis also revealed limitations to inferring MNS, ABO and Rh blood group phenotypes from whole genome sequencing data.The whole genome sequencing data had an average coverage of 16X and read length of 150 bp which we found led to mismapping in the glycophorin gene region and difficulty inferring M/N antigen expression as previously suggested 13 .An instance of the ABO O allele (Thr87AspfsTer107) being miscalled as a homozygote rather than a heterozygote also suggests a deeper coverage could improve blood group inferences using commonly typed SNVs.Lastly, the most unresolved discordances are in the MNS and Rh blood groups, likely given the complexity of these loci.Long-read sequencing or alignment to a pangenome, which can improve identification of structural variants 56,57 , may be necessary for reliable inference of these blood groups from DNA sequence.
Although discordant samples were not fully resolved in the MNS, Lewis, Lutheran, and P1 blood groups, we identified putatively causal variants that warrant further investigation.The Dantu SV identified in one of the discordant MNS samples has previously been believed to express the s+ antigen 51 .However, we did not identify any other potential causal variants within GYPB exons or cis-regulatory element binding sites that could have caused the S+s-phenotype in this individual.Within the Lewis blood group, we identified a SNV that could result in the weak secretor phenotype.Presently, only two weak secretor alleles have been described (Ile140Phe and Met99Leu) [36][37][38] which reduce enzymatic activity and fall within the same exon of FUT2 as the allele we identified, Ala335Thr.We also identified a candidate regulatory variant in a GATA1 transcription factor binding site of the erythrocyte-specific transcription factor locus, SPI1 that could cause the In(Lu) phenotype as a result of reduced transcription of SPI1 58 , akin to the associated SNVs in a GATA1 binding site of EKLF 44 .Similarly, a candidate causal allele that falls within a binding site for the erythrocyte-specific transcription factor STAT1 in the 5'UTR of A4GALT may result in the absence of P1 antigen expression.
In conclusion, our findings document the alleles underlying common blood group antigens in Omanis and highlight the importance of considering population history when evaluating variants for blood group genotyping panels.For instance, the RHD* indel, erythrocyte-specific null allele in the Duffy blood group, and GYP.He(P2) alleles are most common to African populations.These alleles were found in multiple Omani blood donors from this dataset and their prevalence in the population is consistent with what is known about the shared genetic ancestry of the Omanis with East African populations 19,59,60 .Overall, this study emphasizes the necessity to increase population representation in genotype databases and indicates that whole genome sequencing paired with serology is a valuable approach for doing so in transfusion practice.
Tables Table 1.Blood group system and antigen concordances from the comparison of blood typing by serology and inference from the commonly typed genetic variants.Variant information and allele frequency in this dataset are shown in the last two columns.†Average concordance is calculated as the overall percent of correctly inferred individuals (n = 100) predicted by genetic variants per blood group.
*Antigen concordance is calculated as the percent of individuals with each antigen expression correctly predicted by the genetic variants.

Table 2 .
Summary of known variants resolving discrepancies between inferred phenotypes from whole genome sequence data and phenotypes reported by serology in the Omani samples.

Table 3 .
Summary of putatively novel causal variants identified in the discordant Omani samples.

Table 4 .
Adjusted per blood group concordances when including variants that resolved discrepancies (Table2) between inferred phenotypes from whole genome sequence data and phenotypes reported by serology in the Omani samples.Allele frequencies of known blood group variants newly described in the Omani population (OM) compared to frequencies reported for Middle Eastern (ME), African (AFR), South Asian (SAS), and European (EUR) populations in gnomAD v4.0. Figure 2. Alluvial plots showing the discordances in the MNS blood group.The left side of the plots are phenotypes determined by serology, and the right side are the phenotypes inferred by whole genome sequence data.The colors correspond to phenotypes inferred by serology.A) 88% concordance for the M and N antigens.B) 97% concordance for the S and s antigens.Figure 3.Copy number inference across the glycophorin genes based on read coverage.The xaxis corresponds to positions across the glycophorin gene region on chromosome 4.The vertical black bars indicate the genes from left to right: GYPE, GYPB, GYPA.The y-axis is labelled by each sample in the Omani dataset inferred as having a structural variant, except for HG02554 which is a 1000 genomes sample known to carry the Dantu structural variant.The dotted horizontal gray lines separate the inference for each sample.The colors correspond to the number of gene copies with white regions indicating two copies (no copy number variation relative to the reference).